Take-Home Exercise 01

Water Points in Nigeria

Geospatial analytics hold tremendous potential to address complex problems facing society. The purpose of this study is to understand the spatial patterns of non-functional water points in Nigeria.

1.0 Introduction

Loading Packages

We will use the following packages:

  • sf: import geospatial datasets

  • tidyverse: manipulate aspatial data

  • spdep: compute spatial weights and autocorrelation

  • tmap: plot maps

  • funModeling: quick exploratory data analysis

Loading and Preparing Waterpoint Dataset

The water point data is collected by the Water Point Data Exchange (WPdx) whose goal is to improve water access to rural communities by providing data to enable data-driven decision making. The dataset can be found here and the data dictionary here. The data is in csv format with latitude and longitude information.

wp <- read_csv("data/WPdx_plus_Nigeria.csv")
Rows: 95008 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): #clean_adm1, #clean_adm2, #status_clean
dbl (2): #lat_deg, #lon_deg

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Due to the size of the dataset, it has already been pre-processed to keep only entries in Nigeria and some unused variables have been removed. The following code was used to pre-process the raw data file “Water_Point_Data_Exchange_-_Plus__WPdx__.csv” from the website but is not run on this page (raw data file is also not found on GitHub).

wp <- read_csv("data/Water_Point_Data_Exchange_-_Plus__WPdx__.csv") %>%
  filter(`#clean_country_name`=="Nigeria") %>%
  select(c(3:4, 13:14, 22)) %>%
  write_csv("data/WPdx_plus_Nigeria.csv")
glimpse(wp)
Rows: 95,008
Columns: 5
$ `#lat_deg`      <dbl> 7.980000, 6.964532, 6.486940, 6.727570, 6.779900, 6.95…
$ `#lon_deg`      <dbl> 5.120000, 3.597668, 7.929720, 7.648670, 7.664850, 7.77…
$ `#clean_adm1`   <chr> "Ekiti", "Ogun", "Ebonyi", "Enugu", "Enugu", "Benue", …
$ `#clean_adm2`   <chr> "Moba", "Obafemi-Owode", "Ohaukwu", "Isi-Uzo", "Isi-Uz…
$ `#status_clean` <chr> NA, "Functional", NA, NA, NA, NA, NA, NA, NA, NA, NA, …

We can see that the variable names have some special characters which is not ideal. There are also a lot of NA values in the variable of interest (#status_clean). The following code chunk cleans the variable names and replaces the NA values with “Unknown”.

wp <- wp %>%
  rename_with(~str_replace(.x, "#", "")) %>%
  mutate(status_clean=replace_na(status_clean, "Unknown"))

Let’s check the values of status_clean.

wp %>%
  group_by(status_clean) %>%
  summarise(n=n()) %>%
  ungroup() %>%
  kable() %>%
  kable_styling()
status_clean n
Abandoned 175
Abandoned/Decommissioned 234
Functional 45883
Functional but needs repair 4579
Functional but not in use 1686
Non-Functional 29385
Non-Functional due to dry season 2403
Non functional due to dry season 7
Unknown 10656

We can see that the categories are more detailed than we need. To study the proportion of functional and non-functional waterpoints, we can combine the categories into 3 categories: “Functional”, “Non-functional” and “unknown”

wp <- wp %>%
  mutate(status = case_when(
    status_clean %in% c("Abandoned/Decommissioned", 
                        "Abandoned",
                        "Non-Functional",
                        "Non functional due to dry season",
                        "Non-Functional due to dry season") ~ "Non-functional",
    status_clean == "Unknown" ~ "Unknown",
    status_clean %in% c("Functional", 
                        "Functional but not in use",
                        "Functional but needs repair") ~ "Functional"
  ))

Let’s visualise the proportions of functionality of the waterpoints. On the whole, only 55% of waterpoints are functional.

freq(wp, input="status")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
  Please report the issue at <https://github.com/pablo14/funModeling/issues>.

          status frequency percentage cumulative_perc
1     Functional     52148      54.89           54.89
2 Non-functional     32204      33.90           88.79
3        Unknown     10656      11.22          100.00

Now, I convert the aspatial data into geospatial point data from the latitude and longitude data using the st_as_sf() function. The original GCS of the data is WGS1984 (EPSG:4326) from the data dictionary. We need to project it to the EPSG:26391 CRS.

wp_sf <- st_as_sf(wp, 
                  coords = c("lon_deg", "lat_deg"),
                  crs=4326) %>%
  st_transform(crs = 26391)

Loading Administrative Boundary Data

I will also use the Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon dataset from geoBoundaries.

adm_bound <- st_read(dsn="data",
               layer="geoBoundaries-NGA-ADM2")
Reading layer `geoBoundaries-NGA-ADM2' from data source 
  `D:\lins-92\ISSS624\Take-home_EX01\data' using driver `ESRI Shapefile'
Simple feature collection with 774 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442
Geodetic CRS:  WGS 84

The administrative boundary dataset also needs to be projected to the EPSG:26391 CRS.

adm_bound <- adm_bound %>%
  st_transform(crs = 26391)
glimpse(adm_bound)
Rows: 774
Columns: 6
$ shapeName  <chr> "Aba North", "Aba South", "Abadam", "Abaji", "Abak", "Abaka…
$ Level      <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ shapeID    <chr> "NGA-ADM2-72505758B79815894", "NGA-ADM2-72505758B67905963",…
$ shapeGroup <chr> "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NGA", "NG…
$ shapeType  <chr> "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "ADM2", "AD…
$ geometry   <MULTIPOLYGON [m]> MULTIPOLYGON (((552560.3 12..., MULTIPOLYGON (…

Let’s check what the Nigeria Level-2 Administrative Boundary and water points data looks like.

tmap_mode("plot")
tmap mode set to plotting
tm_shape(adm_bound) +
  tm_polygons() +
  tm_text("shapeName", size=0.2) +
tm_shape(wp_sf) +
  tm_symbols(size=0.1)

2 Status of Waterpoints

Now I have 2 geospatial datasets: a point dataset with waterpoint locations and a polygon data with administrative boundaries. I still need to count the number of waterpoints by status for each administrative area. Although the clean_adm2 variable in the waterpoint data should correspond to the administrative area the waterpoint belongs to, it is more accurate to use the GPS coordinates.

I will use the st_join() function to do a spatial join to relate the administrative area names to each waterpoint by its location. The join=st_intersects() argument tells R the type of spatial join to use.

wp_named <- st_join(x = wp_sf,
                    y = adm_bound,
                    join = st_intersects,
                    left = TRUE)

Next, we need check if there are any missing values.

sum(is.na(wp_named$shapeID))
[1] 29

We can plot the points to visually check why these points are missing the administrative area name. Most likely it is because they fall outside any administrative area. If so, we can safely ignore these points. Setting tmap_mode("view") creates an interactive plot so we can zoom in to check the points. In addition, tm_dots() is used instead of tm_shape() this time so that the size of each point scales when zooming in on the interactive map.

wp_named_nas <- wp_named%>%
  filter(is.na(shapeID))
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(adm_bound) +
  tm_polygons() +
tm_shape(wp_named_nas) +
  tm_dots(size=0.1,
          col="red")

We can see that these 29 points fall outside the boundary of Nigeria so we can exclude them.

Now let’s extract the number of waterpoints by status in each administrative boundary and join it to the administrative boundary layer. First, I need to remove the geometry data using the st_drop_geometry() function to manipulate it like a regular dataframe using tidyr and dplyr functions.

The next step is to group by administrative area name and status to generate the count. Lastly, we pivot from long to wide format for joining with the administrative boundary dataset. The values_fill=0 argument replaces any na values in the values_from variable with 0.

prop <- wp_named %>%
  st_drop_geometry() %>%
  group_by(shapeID, status) %>%
  summarise(n=n()) %>%
  ungroup() %>%
  pivot_wider(id_cols=shapeID,
              names_from=status,
              values_from=n, 
              values_fill=0)
`summarise()` has grouped output by 'shapeID'. You can override using the
`.groups` argument.
head(prop, n=5) %>%
  kable() %>%
  kable_styling()
shapeID Functional Non-functional Unknown
NGA-ADM2-72505758B10049836 31 39 0
NGA-ADM2-72505758B10063467 50 38 30
NGA-ADM2-72505758B10065661 104 51 23
NGA-ADM2-72505758B10302610 64 53 84
NGA-ADM2-72505758B11317593 51 49 0

Now, we use left_join() to relate the counts to the administrative boundary geospatial data. We also need to replace any na counts with 0 and add a new variable for total number of waterpoints.

adm_wp <- left_join(x=adm_bound,
                    y=prop,
                    by="shapeID") %>%
  mutate(across(c(6:8), ~replace_na(.x, 0))) %>%
  mutate(Total = Functional + `Non-functional` + Unknown)

Finally, we can plot the number of waterpoints by status in each administrative area.

tmap_mode("plot")
tmap mode set to plotting
total <- qtm(adm_wp, "Total")
func <- qtm(adm_wp, "Functional")
nonfunc <- qtm(adm_wp, "Non-functional")
unknown <- qtm(adm_wp, "Unknown")

tmap_arrange(total, func, nonfunc, unknown,
             asp=1, ncol=2, nrow=2)

The distribution of waterpoints across Nigeria does not appear to be evenly distributed. There are some small administrative areas with a high number of waterpoints (total and functional) in the north of Nigeria. There is one administrative area in the central west with a high number of waterpoints, but also a high number of non-functional waterpoints. Based on the distribution of waterpoints of unknown status, we can infer that the north of Nigeria is likely more developed because there are fewer waterpoints of unknown status; likewise, central Nigeria may not be as developed because there is a high number of water points of unknown or non-functional status.

The code chunk below plots total waterpoints using quantile breaks. We can also add a histogram to view the distribution of total waterpoints.

tm_shape(adm_wp)+
  tm_polygons("Total",
              style="quantile",
              palette="RdBu",
              legend.hist=TRUE) +
  tm_layout(main.title="Total Waterpoints in Nigeria",
            main.title.size=1.1,
            title.snap.to.legend=FALSE,
            legend.outside=TRUE,
            legend.hist.width = 1.1)

From this map, we can see that the north-east and south of Nigeria tend to have fewer waterpoints. More than 60% of administrative areas have les than 200 waterpoints. As we do not know the population or water demand of each administrative area, it is difficult to say which areas are water stressed or need additional water infrastructure.

The spatial distribution of waterpoint supply shows some relation to the climate classification and population distribution. The south of Nigeria has higher rainfall which could mean less reliance on man-made waterpoints and thus fewer waterpoints.Area with higher population also tend to have more waterpoints.

Koppen-Geiger climate classification map for Nigeria. Source: Beck, H. E., et al. (2018)

Population density in Nigeria. Source: Wikimedia Commons

tm_shape(adm_wp)+
  tm_polygons("Functional",
              style="quantile",
              palette="RdBu",
              legend.hist=TRUE) +
  tm_layout(main.title="Total Functional Waterpoints in Nigeria",
            main.title.size=1.1,
            title.snap.to.legend=FALSE,
            legend.outside=TRUE,
            legend.hist.width = 1.1)

tm_shape(adm_wp)+
  tm_polygons("Non-functional",
              style="quantile",
              palette="-RdBu",
              legend.hist=TRUE) +
  tm_layout(main.title="Total Non-Functional Waterpoints in Nigeria",
            main.title.size=1.1,
            title.snap.to.legend=FALSE,
            legend.outside=TRUE,
            legend.hist.width = 1.1)

We can also plot it as proportions of total water points. First, we need to generate new variables for proportions.

adm_wp <-adm_wp %>%
  mutate(pFunctional = Functional/Total,
         `pNon-functional` = `Non-functional`/Total,
         pUnknown = Unknown/Total)

The following plot shows the number of non-functional waterpoints out of total waterpoints by administrative area. We can see that there are some administrative areas where there some few waterpoints and most of them are non-functional.

ggplot(adm_wp) +
  geom_bar(aes(x=reorder(shapeID, `pNon-functional`, decreasing=TRUE), 
               y=Total,
               fill="Total"),
           stat="identity") +
    geom_bar(aes(x=reorder(shapeID, `pNon-functional`, decreasing=TRUE), 
               y=`Non-functional`,
               fill="Non-functional"),
           stat="identity",
           alpha=0.8) +
  scale_fill_manual(name="",
                    values=c("red", "gray30")) +
  labs(title="Number of Water by Administrative Area",
       subtitle="(sorted by proportion of non-functional)",
       y="Number of waterpoints",
       x="Administrative Areas")+
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.line.y=element_line(colour="grey50"))

Now let’s plot the proportions spatially.

total <- tm_shape(adm_wp)+
  tm_polygons("Total",
              style="quantile",
              palette="RdBu",
              title="")+
  tm_layout(main.title="Total waterpoints")

func <- tm_shape(adm_wp)+
  tm_polygons("pFunctional",
              style="quantile",
              palette="RdBu",
              title="")+
  tm_layout(main.title="Proportion functional")

nonfunc <- tm_shape(adm_wp)+
  tm_polygons("pNon-functional",
              style="quantile",
              palette="-RdBu",
              title="")+
  tm_layout(main.title="Proportion non-functional")

unknown <- tm_shape(adm_wp)+
  tm_polygons("pUnknown",
              style="quantile",
              palette="RdBu",
              title="")+
  tm_layout(main.title="Proportion unknown")

tmap_arrange(total, func, nonfunc, unknown,
             asp=1, ncol=2, nrow=2)

We can see that many administrative areas in the north have more waterpoints and a higher proportion of functional water points. Many states in the south part of Nigeria have few waterpoints a high proportion of non-functional waterpoints. These areas are likely more water-stressed and efforts to restore waterpoints can be concentrated in those areas.

In addition, there are some administrative areas in north-east tip of Nigeria without any waterpoints at all